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Position Automata for Kleene Algebra with Tests 


Alexandra SILVA! 


Abstract 


Kleene algebra with tests (KAT) is an equational system that com- 
bines Kleene and Boolean algebras. One can model basic programming 
constructs and assertions in KAT, which allows for its application in 
compiler optimization, program transformation and dataflow analysis. 
To provide semantics for KAT expressions, Kozen first introduced au- 
tomata on guarded strings, showing that the regular sets of guarded 
strings plays the same role in KAT as regular languages play in Kleene 
algebra. Recently, Kozen described an elegant algorithm, based on 
“derivatives”, to construct a deterministic automaton that accepts the 
guarded strings denoted by a KAT expression. This algorithm gener- 
alizes Brzozowski’s algorithm for regular expressions and inherits its 
inefficiency arising from the explicit computation of derivatives. 

In the context of classical regular expressions, many efficient algo- 
rithms for compiling expressions to automata have been proposed. One 
of those algorithms was devised by Berry and Sethi in the 80’s (we shall 
refer to it as Berry-Sethi construction/algorithm, but in the literature 
it is also referred to as position or Glushkov automata algorithm). 

In this paper, we show how the Berry-Sethi algorithm can be 
used to compile a KAT expression to an automaton on guarded strings. 
Moreover, we propose a new automata model for KAT expressions and 
adapt the construction of Berry and Sethi to this new model. 
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1 Introduction 


Efficient algorithms for compiling regular expressions to deterministic and 
non-deterministic automata became crucial when regular expressions started 
to be widely used for pattern matching. One of the most efficient algorithms 
to translate regular expressions into automata was devised by Berry and 
Sethi [4]. 

Kleene algebra with tests (KAT) is an equational system which extends 
Kleene algebra, the algebra of regular expressions, with a Boolean algebra 
of tests. KAT expressions are simply regular expressions where the alphabet 
is two sorted: it contains actions and tests. The latter must satisfy the 
axioms of Boolean algebra. This seemingly simple extension allows for a 
powerful language where modeling and verification of basic programming 
constructs, such as while loops, conditional tests, Hoare triples, and goto 
statements, is possible. KAT has successfully been applied in low-level verifi- 
cation tasks involving program transformation, compiler optimization and 
dataflow analysis [15, 1, 13). 

The purpose of this paper is twofold. On the one hand, we observe 
that the Berry-Sethi algorithm can be directly applied to compile a KAT 
expression into an equivalent automaton on guarded strings, the original 
semantic model for KAT proposed by Kozen in [12]. On the other hand, we 
propose a new automaton model to provide semantics for KAT and adapt the 
Berry-Sethi algorithm to the new model. 

The paper is organized as follows. In Section 2, we start with recalling 
the main concepts we need from the coalgebraic theory of (non-)deterministic 
automata and regular expressions. In Section 3, we recall the Berry-Sethi 
algorithm, from classical regular expressions to non-deterministic automata. 
In Section 4, we introduce the basics of Kleene algebra with tests (KAT), 
an extension of the algebra of regular expressions with Boolean tests, and 
automata on guarded strings (both the deterministic and non-deterministic 
versions). Section 5 contains the main results of the present paper: we show 
that the Berry-Sethi construction can be applied directly to KAT yielding a 
non-deterministic automaton on guarded strings. Moreover, we introduce a 
new automaton model for KAT which can be regarded as a compromise be- 
tween the non-deterministic and deterministic versions of Kozen’s automata, 
and adapt the Berry-Sethi construction to compile a KAT expression into 
this new automata model. The main advantage of the new model is that 
it will have a smaller number of states than the corresponding automaton 
on guarded strings. This is important in certain areas of application of KAT, 
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such as verification or model checking of properties. 


2 Deterministic Automata and Regular Expres- 
sions, Coalgebraically 


In this section, we introduce basic notions and notation on deterministic au- 
tomata and regular expressions. Our presentation is based on the coalgebraic 
view on automata [18, 19, 20]. 


2.1 Coalgebra 


We will use three notions from coalgebra which we introduce upfront. 

Let Set be the category of sets and functions. An F-coalgebra is a 
pair (S,f: S — F(S)), where S is a set of states and F: Set —> Set is 
a functor. The functor F, together with the function f, determines the 
transition structure (or dynamics) of the F-coalgebra [19]. 

An F-homomorphism h: (S,f) > (T,g), from an F-coalgebra (S, f) 
to an F-coalgebra (T,g), is a function h: S + T preserving the transition 
structure, z.e., such that the following diagram commutes: 


S 
F(S) epr. 


_ hp 


gona Th) ef 


An F-coalgebra (Q,w) is said to be final if for any F-coalgebra (S, f) there 
exists a unique ¥-homomorphism behs: (S, f) > (Q,w): 


G2 ke) 
f w wo behg = F(behs) o f 
(5) “atbahs)” a) 


The notion of finality will play a key role later in providing semantics to 
automata and regular expressions. The functor(s) we consider in the rest of 
the paper are part of a class for which final coalgebras exist. 


370 A. Silva 


2.2 (Non-)Deterministic Automata, Coalgebraically 


Let A be a set of input letters (or symbols). A deterministic automaton 
with inputs in A is a pair (S,(og,ts)) consisting of a set of states S and 
a pair of functions (og,ts), where 0: S — 2 is the output function, which 
determines if a state s is final (o9(s) = 1) or not (og(s) = 0), andt: S > $4 
is the transition function”, which, given an input letter a determines the 
next state. We will frequently write s, to denote ts(s)(a) and refer to sq 
as the derivative of s for input a. Moreover, when depicting deterministic 
automata we will draw a single circle around non-final states and a double 
circle around final ones. 

We illustrate the notation we will use in the representation of determin- 
istic automata in the following example. 


b 7 b os(s1) = 0 os(s2) = 1 
i) a (Si)a = $92 (s1)p = $1 
(saa = $1 (Selo = 82 


Deterministic automata are coalgebras for the functor D(X) = 2 x X4, 
The classical notion of automata homomorphism will instantiate precisely 
to the definition of coalgebra homomorphism for the functor D: given two 
deterministic automata (5, (og,ts)) and (T, (or,tr)), a function h: S > T 
is a homomorphism if it preserves outputs and input derivatives, that is 
or(h(s)) = og(s) and h(s)q = tr(h(s))(a) = hA(ts(s))(a) = h(sq), for all 
a € A. These equations correspond to the commutativity of the following 
diagram. 


T 


Ss 
(ss) | | one 


2x GA —____.2x TA 
idxh4 


The input derivative sq of a state s for input a € A can be extended to the 
word derivative s,, of a state s for input w € A* by defining, by induction on 
the length of w, s¢ = s and Saw: = (Sqa)y’, where € denotes the empty word 
and aw’ the word obtained by prefixing w’ with the letter a. This enables an 


?Here, and in the sequel, we represent the transition function of an automaton curried. 
Alternatively, and equivalently, the function could be represented as t: S x A > S or 
t: S + (A > S). The curried representation is more amenable for the coalgebraic 
treatment of automata. 
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easy definition of the semantics of a state s of a deterministic automaton - 
the language L(s) € 24” of a state s is given by the following characteristic 
function: 


L(s)(w) = os(Sw) (1) 


The language recognized by s contains all words w for which L(s)(w) = 1. 
For instance, the language recognized by state s; of the automaton above 
is the set of all words with an odd number of a’s. It is easy to check 
that, for example, L(s,)(bab) = 0g((s1)bav) = 05(S2) = 1 and L(s;)(aab) = 
05((81)aab) = 05(81) = 0. 

Given two deterministic automata (5, (og,tg)) and (T, (or,tr)) a rela- 
tion RC S x T is a bisimulation if (s,t) € R implies 


og(s)=or(t) and (sq,ta)€ Rforallac A 


We will write s ~ t whenever there exists a bisimulation R containing (s, t). 
This concrete definition of bisimulation can be recovered as a special case of 
the general definition of bisimulation for F-coalgebras [19] by instantiating 
the functor F to D(X) = 2 x X4. The following theorem guarantees that 
the definition above is a valid proof principle for language equivalence of 
deterministic automata. We omit the proof here, for details see [18]. 


Theorem 1 (Coinduction) Given two deterministic automata (S, (og, ts)) 
and (T, (or, tr)), s€ S andt eT: 


s~t=>L(s) = L(t) 


To determine whether two states s and t of two deterministic automata 
(S,(os,ts)) and (T, (or,tr)) (over the same alphabet) recognize the same 
language we can now use coinduction: it is enough to construct a bisimulation 
containing (s,t). 


Example 1 Let (S,(og,ts)) and (T, (or, tr)) be the deterministic automata 
over the alphabet {a,b} given by 


372 A. Silva 


The relation R = {(s1,t1), (s2, t1), (3, t2)} ts a bisimulation: 


0g(s1) = 0g(s2) =0=or(ti) og(s3) = 1 = or(te) 


Sie a=sr Rh =(hie (s1)p = 83 R to = (t1)p 
Sia = so Ry = (ie (sz), = 83 R to = (t1)p 
83)q = 83 R to = (te)a (s3)p = 83 R to = (t2)p 


The language recognized by a state s is the behavior (or semantics) of 
s. Thus, the set of languages 24” over A can be thought of as the universe 
of all possible behaviors for deterministic automata. We now turn 24° into 
a deterministic automaton (with inputs in A) and then show that such an 
automaton has the universal property of being final, which will connect the 
coalgebraic semantics induced by the functor with the classical language 
semantics we have just presented. 

For an input letter a € A, the input derivative K, of a language K € 24° 
on input a is defined by K,(w) = L(aw). The output of K is defined by 
K(e). These notions determine a deterministic automaton (24", (o,,tz)) 
defined, for K € 24” and a € A, by o,(K) = K(e) and tz(K)(a) = Ka. 


Theorem 2 The automaton (24°, (oz,tz)) is final. That is, for any deter- 
ministic automaton (S,(og,ts)), L: S > 24° is the unique homomorphism 
which makes the following diagram commute. 


L 


S » QA" 
wsts)| (oz ,tr) 
TS GS os Oe 

x et 42% (2 ) 


Given a state s, the language L(s) is precisely the language recognized by s 
(as defined in equation (1)). 


Proof: We have to prove that the diagram commutes and that L is unique. 
First the commutativity: 


or(L(s)) = L(s)(€) = os(s) 
tr (L(s))(a) = L(s)q = Aw.L(s)(aw) = Aw.L(sq)(w) = L(sa) = L(ts(s))(a) 


For the one but last step, note that, by definition of L (equation (1)) 


L(s)(aw) = 08 (Saw) = 05 ((Sa)w) = L(8a)(w) 
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For the uniqueness, suppose there is another morphism h: S — 24° such 
that, for every s © S anda € A, oz(h(s)) = 0g(s) and h(s)q = A(Sa). 
We prove by induction on the length of words w € A* that h(s) = L(s). 


The semantics induced by the unique map L into the final coalgebra coincides, 
in the case of deterministic automata, with the bisimulation semantics we 
defined above, that is s ~~ t = L(s) = L(t) (the implication in Theorem 1 is 
actually an equivalence). 

Moore automata are a slight variation on deterministic automata. More 
precisely, the codomain of the output function is changed from 2 to B. 
Formally, a Moore automaton with inputs in A and outputs in B is a pair 
(S, (0, t)) where S is a set of states, 0: S > B is the output function and 
t: S > SA is the transition function. 

Similarly to the change on the output function, the carrier set of the 
final coalgebra for Moore automata is B“” (in contrast to 24"), the coalgebra 
structure (oy,tm): Be" + Bx (Be is defined similarly 


om(f) = fle) tu(f)(a)(w) = flaw) 
and we have the following finality result (proof similar to Theorem 2). 


Theorem 3 The automaton (B4’,(oy,tm)) is final. That is, for any 
Moore automaton (S, (os, ts)), there is a unique homomorphism which makes 
the following diagram commute. 


Ss : BA 
osts)| [tomcenn 
A A*\A 
BxsS ~~ 55,4 7 Bx (B ) 


A non-deterministic automaton (NDA) is similar to a deterministic automa- 
ton but the transition function gives a set of next-states for each input letter 
instead of a single state. NDA’s often provide more compact representa- 
tions of regular languages than deterministic automata. For that, they are 
computationally very interesting and much research has been devoted to 
constructions compiling a regular expression into an NDA [2, 8, 4, 21, 16, 7] 
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(we will show an example of such construction below). Surprisingly, in what 
concerns language acceptance NDA’s are not more powerful than determinis- 
tic automata. A state s of an NDA accepts a word if there is a path starting 
in s labeled by w which leads to a final state. For every NDA there exists 
a deterministic automaton with a state equivalent to a given state of the 
NDA. Such deterministic automaton can be obtained from a given NDA by 
the so-called subset (or powerset) construction first introduced by Rabin 
and Scott [17], which we will show below coalgebraically. 

Formally, an NDA over the input alphabet A is a pair (S, (0, 0)), where 
S is a set of states and (0,6): 5 > 2 x (B(S))4 is a function pair with o as 
before and where 6 determines for each input letter a a finite set of possible 
next states. 

As an example of the compactness of NDA’s, consider the following 
regular language (taken from [11]): 


{w € {a, b}* | the fifth symbol from the right is a} 


One can intuitively construct an NDA with a state s, having two outgoing 
a-transitions, which recognizes this language (which could be, for instance, 
denoted by the regular expression (a + b)*a(a + b)(a + b)(a + b)(a + b)): 


a,b 


A deterministic automaton recognizing the same language will have at least 
2° = 32 states. 

In order to formally compute the language recognized by a state x of 
an NDA J, it is usual to first determinize it, constructing a deterministic 
automaton det(A) where the state space is (5), and then compute the 
language recognized by the state {x} of det(A). Next, we describe in 
coalgebraic terms how to construct the automaton det(A) [19]. 

Given an NDA A = (S, (0,6)), we construct det(A) = (2(S), (6, 6)), 
where, for all Y € B(S), a € A, the functions 6: B(S) > 2 and 6: B(S) > 
B(S)4 are 


ee f if Jyeyo(y) = 1 


5(Y)(a) = LU d(y)(a). 


0 otherwise yey 


The automaton det(A) is such that the language L({x}) recognized by {x} 
is the same as the one recognized by « in the original NDA A (more generally, 
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the language recognized by state X € 2(S) of det(A) is the union of the 
languages recognized by each state x € X of A). 

We summarize the situation above with the following commuting dia- 
gram: fe 


oe) eee ad 


(0,6) (0) 
| (3,5) | L str) 


2x B(S)4 sox (22) 


We note that the language semantics of NDA’s, presented in the above 
diagram, can alternatively be achieved by using -coinduction [3, 9]. 


2.3 Regular Expressions 


We will now recall the basic definitions and results on regular expressions. 
The set (A) of regular expressions over a finite input alphabet A is given 
by the following syntax: 


r,71,72: =1|0)/aeAlryt+re| rire |r* 


The semantics of regular expressions is given in terms of languages? and it is 
defined as a map L: R(A) + P(A*) by induction on the syntax as follows: 


L(1) = te} L(0) =0 L(a) = {a} 


2 
L(ry + re) = L(r1) UL(re)) L(rire) = L(ri)- L(re)  L(r*) = L(r)* @) 
where, given languages 1, 1, and lg, ly - lg = {wywe | wr € ly and we € Io}; 
* = Unenl”, and, for n € N, I” is inductively defined by 1° = {e} and 
Pot afl 
Here, we have intentionally reused L(r) to represent the language 
denoted by a regular expression r € R(A) (recall that we had used L(s) to 
represent the language recognized by a state s of a deterministic automaton). 
This is because we know from Kleene’s theorem that finite deterministic 
automata recognize precisely the languages denoted by regular expressions. 
We now equip the set R(A) with a deterministic automaton structure. 
This definition was first proposed by Brzozowski in his paper Derivatives of 


3Here, we represent languages as subsets of A*, rather than functions ae Although 
we prefer the latter view on languages, the traditional semantics of regular expressions 
was presented as sets of words and we recall it here unchanged. We will only use the set 
interpretation on languages when referring to the classical semantics of regular expressions. 
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regular expressions [5] and, for that reason, it is occasionally referred to as 
Brzozowski derivatives. We define the output og(r) of a regular expression 
r by 


og(0) = O og(r1 +72) = og(r1) V og(r2) 
og(1) = 1 og (7112) = og(r1) A og(r2) 
og(a) = 0 og(r*) = 1 


and the input derivative tz(r)(a) = ra by 
(Q)a = O (ritTaa = (ti)at (r2)a 
fee if og(71) = 0 


(ri)ar2 + (r2)q otherwise 


1 ifa=da’ 
(aja = 40 8 rq = rar 
0 ifa#a 


(Ile = O (rirzja = 


In the definition of og we use the fact that 2 = {0,1} can be given a lattice 
structure ({0,1},V,A,0,1) (0 is neutral with respect to V and 1 with respect 
to A). 

Intuitively, for a regular expression r, og(r) = 1 if the language denoted 
by r contains the empty word ¢€ and og(r) = 0 otherwise. The regular 
expression rg denotes the language containing all words w such that aw is 
in the language denoted by r. 

Similarly to what happened in deterministic automata, the input deriva- 
tive rq of a regular expression r for input a can be extended to the word 
derivative r,, of r for input w € A* by defining r. = r and raw = (Ta)w- 

We have now defined a deterministic automaton (R(A), (or, tr)) and 
thus, by Theorem 2, we have a unique map L which makes the following 
diagram commute. 


R(A) - Ey al (3) 
(onta)| [leon 
A____ _ A*\A 
2 x (R(A)) vere 426 (2) 


We now prove that, for any r € R(A), the semantics defined inductively 
in (2) is the same as the one given by the unique map into the final coalgebra 
Lal A\ 42". 
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Theorem 4 For all r € R(A) andwe A*, 
w € L(r) & og(rw) = 1 

where L(r) is the inductively defined semantics from equation (2). 
Proof: | By induction on the structure of r. 

L(0) = 0 and o9(()w») =1 

L(1) = {e} and og((1)w) = 
L(a) = {a} and og((a)w) = 
w € L(ry +12) & w € L(r1) or w € L(re) 

(IH) 


1Sw=e 
1 


IH 
S&S or((T1)w) =lor or((T2)w) =-1s oR((T1 + 12) ui) =1 
w € L(rire) Sw = wiwe, with wi; € L(ri) and we € L(r2) 


 on((ri)w,) = 1 and om((r2)w2) = 1 on((rir2)w) = 1 


wé L(r*) Swe Lr)", forsomen€NSw=u}...Wn with w; € L(r) 


w= WL... Wn With og(rw,) =1< og((r*)w) = 1 


We have now proved that the classical semantics of both deterministic 
automata and regular expressions coincides with the coalgebraic semantics. 
In the sequel, we will say that a regular expression r and a state s of a 
deterministic automaton are equivalent if L(s) = L(r). 


3 From Regular Expressions to Non-Determinis- 
tic Automata: the Berry-Sethi Construction 


There are several algorithms to construct a non-deterministic automaton 
from a regular expression. We will show here the one presented in [4] by 
Berry and Sethi. We shall generalize this algorithm in the next section in 
order to deal with the expressions of Kleene algebra with tests. The basic 
idea behind the algorithm is that of marking: all input letters in a regular 
expression are marked (with subscripts) in order to make them distinct. As 
an example, a marked version of (ab+ b)*ba is (a;b2 + b3)*b4a5, where a, and 
as are considered different letters. The choice we made for the subscripts are 
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the positions of the letters in the expression. For that reason the Berry-Sethi 
construction is often referred to as position automaton. 

We will explain the algorithm with an example (taken from [4]) and 
then state the results that justify its correctness. 


Example 2 Let r = (ab+b)*ba and let F = (a,b2 + b3)*baas be its marked 
version. We define c; = (T)w, for w a prefix of length i of aybobsb4a5, and 
call it the continuation i of fF. We then construct an automaton from TF in 
the following way: 


1. The automaton will have a state i € {1,2,3,4,5} for each distinct 
symbol in F plus an extra state 0 that will be language equivalent* to Fr. 


2. A state i has a transition to state j, labeled by a;, if (ci)a; = cj. A 
state i is final if or(ci) = 1. 


The automaton resulting from F = (a bz + b3)*baas is the following 


5) 


Co = (T)e = (a1be + b3)*baas C1 = (F)a, = 02(a1b2 + b3)*baas 
62 = (Parte = (C1)ay = (iba + 3)"bsa3 = ca = (Fay bababs = (C3) bg = Gs 
C3 = (T)aybob3 = (C2)b3 = (a1b2 + b3)*baas r 


ie) 


5 = (7) ax bobsbaas = (ca)as =1 


Note that to compute the transition structure we had to compute all input 
derivatives for each c;. This can be overcome by using some of the properties 
of derivatives of expressions with distinct symbols (more below). Now, note 
that by deleting all the marks in the labels of the automaton above the state 
0 of the resulting NDA accepts precisely the language denoted by (ab + b)*ba 


‘Here, by language equivalent we mean that the state 0 recognizes the same language 
that the expression denotes. More precisely, L({0}) = L(7). 
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(all words that finish with ba and all other occurrences of a are followed by 
one or more b’s). 


The algorithm above works as expected due to the properties of derivatives 
of expressions with distinct letters. We summarize the crucial properties for 
the correctness of the algorithm. 


Theorem 5 ({4, Proposition 3.2 and Theorem 3.4]) Let 7 be the reg- 
ular expression obtained from r by marking all symbols to make them distinct. 
Then, the following holds: 


1. If A’ is an automaton with a state s such that L(s) = L(r), then the 
state s of the automaton A, obtained from A’ by unmarking all the 
labels, is such that L(s) = L(r). 


2. Given any symbol a and word w, the derivative (F)aw is either 0 or 
unique modulo associativity, commutativity and idempotency. 


Starting from a regular expression r € (A), we can then obtain a non- 
deterministic automaton by first marking the symbols, then applying the 
algorithm above and finally unmarking the labels. If wanted, a deterministic 
automaton can then be obtained via the subset construction (the complexity 
of this construction for position automata was studied in [6]). 

In [4], the authors presented also a more efficient way of computing the 
position automaton, based on the fact that each continuation is uniquely 
determined by an input symbol. We briefly recall it here, since this is 
precisely the version we will later generalize for KAT expressions. Let pos(r) 
denote the positions (distinct symbols) in the regular expression r. For any 
regular expression r and 7 € pos(r) we define: 


first(r) = {i| pw € L(r),w € A*,p; € A} 


follow(r,t) = {j | uppju € L(F),u.v € A*, p;,p; € A} 
last(r) = {i| wp; € L(r),w € A*,p; € A} 
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The set first(r) contains all indexes i of the letters p; that can appear in the 
beginning of a word in the language L(7). For instance, in the expression 
7 = (a,b2 + b3)*b4as above first(r) = {1,3,4}. Dually, last(r) contains all 
indexes of letters that can appear at the end of a word in L(r). In the 
example, last(r) = {5}. The set follow(r,i) has all the indexes of letters 
that can follow letter p; in a word in the language L(7r). For instance, in our 
example, follow(r, 2) = {1,3, 4}. 

These sets can be computed efficiently from the expression: we recall [4, 
Proposition 4.3]. 


Proposition 1 ([4, Proposition 4.3]) Let r be a regular expression with 
distinct symbols. F, defined by the rules below, is such that F(r, {!}) yields 
a set of pairs of the form (aj, follow(r!,i)), where! is a symbol distinct from 
all symbols in r. The rules are: 


F(ry +1r2,S) = F(rm,S)UF(r2,S) 

F(rir2, S) = F(r, first(r2) U og(r2).S) U F(re, S) 
F(rj,S) = F(r, first(r1)US) 

F(a, 5) = {(a,S)} 

Fils) = F@,5)=0 


Here, for a set S, 1.5 =S and0.S =. Note that in F also the set last(r) 
is computed: i € last(r) =! € follow(r!, 1). 


The position automaton corresponding to a given regular expression 
r € R(A) is then given by 


Apos(r) = ({0} U pos(F), (0, 6)) 


where 7 is the marked version of r and o and 6 are defined as follows: 


0 otherwise 


oats oli) = i if i € last(r) 


6(0)(a) = {3 | j € first(r), unmark(a;) = a} 


d(i)(a) = {7 | 7 € follow(7, 1), unmark(aj) =a} = i40 
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We show an example of the algorithm above. We consider again r = 
(ab + b)*ba and its marked version 7 = (a;b2 + b3)*bas. 


first (TF) = {1,3,4} first(a,b2 + bs) = {1,3} 
first(ajb2) = {1} first(bsas) = {4} 


F(r, {!}) 
= F((a1b2 + bs)*, {4}) U F(b4as5, {!}) 

F(azb2 + bs, {1, 3, 4}) U F(ba, {5}) U F(as, {!}) 

F(ayb2, {1, 3, 4}) U F(bs, {1, 3, 4}) U {(b4, {5}), (as, {U})} 
F(a1, {2}) U F(bo, {1, 3, 4}) U {(b3, {1, 3, 4}), (ba, {5}), (as, (Ub) } 
{(a1, {2}), (b2, {1, 3, 4}), (bs, {1, 3, 4}), (ba, {5}), (as, (Ub) F 


The position automaton A,.s(7) constructed is the same as the one presented 
above in (4). 

It should be remarked that the construction of the position automaton 
from a regular expression does not always extend to additional operators, 
such as intersection or complement. This is a disadvantage when compared, 
for instance, to the algorithm based on Brzozowski derivatives. 


4 Automata on Guarded Strings and KAT Expres- 
sions 


Kleene algebra with tests (KAT) is an equational system that combines Kleene 
and Boolean algebra. One can model basic programming constructs and 
assertions in KAT, which allows for its application in compiler optimization, 
program transformation or dataflow analysis [15, 1, 13]. In this section, we 
will recall the basic definitions of KAT and we will show how to generalize 
the Berry-Sethi construction (Section 3) in order to (efficiently) obtain an 
automaton from a KAT expression. 


Definition 1 (Kleene algebra with tests) A Kleene algebra with tests 
is a two-sorted structure (%,B,+,-,(—)*,~, 0, 1) where 


e (X,+,-,(—)*, 0,1) is a Kleene algebra, 


e (B,+4+,-,-,0,1) is a Boolean algebra, and 


e (B,+4+,-,~,0,1) is a subalgebra of (5,+,.,(—)*, 0, 1). 


The operator ~ denotes negation. 
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Given a set P of (primitive) action symbols and a set B of (primitive) 
test symbols, we can define the free Kleene algebra with tests on generators 
P UB as follows. Syntactically, the set BExp of Boolean tests is given by: 


BEzp > b:: =b €B| byb2 |b, + bo |b] 0] 4 
The set of KAT expressions is given by 
Exp De, f::=peEP|be BExp|ef|et+f | (e)* 


The free Kleene algebra with tests on generators P UB is obtained by 
quotienting BExp by the axioms of Boolean algebra and Exp by the axioms 
of Kleene algebra. We will use below the natural order on expressions: 
ex<f <= e+f =f, where = denotes equivalence provable using the 
axioms of Kleene/Boolean algebra (these include axioms for idempotency, 
associativity and commutativity of +, among others, for more details see, 
for instance, [12]). 

Guarded strings were introduced in [10] as an abstract interpretation 
for program schemes. They are like ordinary strings over an input alphabet 
P, but the symbols in P alternate with the atoms of the free Boolean algebra 
generated by B. The set At of atoms is given by At = 2. We define the set 
GS of guarded strings by 


GS = (At x P)*At 


Kozen [12] showed that the regular sets of guarded strings play the same role 
in KAT as regular languages play in Kleene algebra (both sets are actually 
the final coalgebra of a given functor). He showed an analogue of Kleene’s 
theorem: automata on guarded strings, which are non-deterministic automata 
over the alphabet PUB, recognize precisely the regular sets of guarded strings. 


Definition 2 (Regular sets of guarded strings) Each KAT expression 
e denotes a set G(e) of guarded strings defined inductively on the structure 
of e as follows: 


G(p) = {apf | a, 8 € At} 
G(b) = {ala<b} 
G(e+f) = G(e)UG(f) 
G(ef) = G(e)oG(f) 
G(e*) a Un>0 Ge)” 


Position Automata for Kleene Algebra with Tests 383 


where, given two guarded strings & = agp --- Pyn_1Qn and y = Bog --- %—1P0n; 
we define the fusion product of x and y by OY = AOPy- ++ PnAnQ +++ In—1En> 
if Qn = Co, otherwise xo y is undefined. Then, given X,Y C GS, X oY is 
the set containing all existing fusion products roy of x E€ X andy € Y and 
X” is defined inductively as X° = X and X"+1 = Xo X”. 

A set of guarded strings is regular if it is equal to G(e) for some KAT 
expression e. Note that a guarded string is itself a KAT expression and 


G(a) = {ah 


Example 3 Consider the KAT expression e = b, + bop over B= {bj, bo} 
and P= {p}. We compute the set G(e): 


G(e) = G(b1) U(G(b2) o G(p)) 
= {ala< bd }U({a|aK< bo} o {ap | a, B € At}) 
= f{fala< bi}Uf{apB | a < b2,8 € At} 


We will now show an example of an automaton on guarded strings. As 
mentioned above such automaton is just a non-deterministic automaton over 
the alphabet A = PUB, that is (S,(og,ts)) with o: S + 2 andt: S > B(S)A. 
State so of the following automaton would recognize (we shall explain the 
precise meaning of this below) the same set of guarded string as e: 


Let us now explain how to compute G(s), the set of guarded strings accepted 
by a state s of an automaton A on guarded strings. A guarded string x 
is accepted by A if x € G(e) for some e € L(s), where L(s) C (PU B)* is 
just the language accepted by s, as defined classically for non-deterministic 
automata and explained coalgebraically in Section 2.2. In the example above, 
we have L(s) = {b1, bop} and thus 


G(s) = G(b1) U G(bop) = G(b) + bop). 


Later in [14], Kozen showed that the deterministic version of automata 
on guarded strings (already defined in [12]) fits neatly in the coalgebraic 
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framework: two expressions are bisimilar if and only if they recognize the 
same set of guarded strings. 

A deterministic automaton on guarded strings is a pair (S, (og, ts)) 
where o: S — 8 (recall that B is the free Boolean algebra on B, satisfying 
B= 2") and t: S > S4**?. Note that formally a deterministic automaton 
on guarded strings is a Moore automaton (see Section 2.2). 

We can obtain a deterministic automaton by using the following gener- 
alization of Brzozowski derivatives for KAT expressions (modulo ACI, as for 
classical regular expressions). 


Definition 3 (Brzozowski derivatives for KAT expressions) Given an 
expression e € Exp, we define E: Exp > B & 24* and D: Exp — Exp***? by 
induction on the structure of e. First, E(e) is given by: 


E(p)=0 E(b)={aeEAtl|a<b} E(ef) = E(e)NE(f) 
He+f)=E(QUBY) BleV= At 


Next, we define €ag = D(e)((a, q)) by 
ale ufp=4q = _ Caql + fag ifa€ E(e) 
Pag - f if p x qd vg ma (fag 7 oe if a g E(e) 
(e+ flag = aq + fag (€")ag = Cage” 


The functions (EZ, D) provide Exp with a deterministic (Moore) automata 
structure, which leads, by finality (Theorem 3), to the existence of a unique 
homomorphism 


Exp — a eis Bes 4+ (Ge. ~ 96s 
(e.0)| (o¢s ,tes) 
gat x Exp nt gat x Ose alae 


which assigns to each expression the language of guarded strings that it 
denotes. The coalgebra structure on 2° is an instantiation of (oj, tyr) 
as presented before Theorem 3 for the output set B = 24* and input set 
A= At x P. More precisely, we have 


ogs(f € (QAP YAEXE)") = fle) 
tes(f)((a, p))(w) = f((a,p)w) 
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or, alternatively, and equivalently, 


ogs(L € 2%) = {ae at|ae L} 
tes(L)((a,p)) = {we (At x P)*) | (a,p)w € LD} 
The concrete definition of G, which can be deduced from the commutativity 


of the diagram above, is precisely the definition which appeared in the 
original paper on guarded strings and which we recalled in Definition 2. 


Example 4 The deterministic automaton of e = b, + bap, which is the 
deterministic counterpart of the automaton in Example 3, would be 


by 1 


‘ (b1 b2,p),(b1 b2,p) 


(b1 b2,p) (01 b2,p) 


(a,p) 


SINCE, for B= {d1, bo}, At= {d1 ba, by bo, by be, b1 ba} and 


€dy bo,p = O+ (b2P)d1 d0,p = Vite — foe a OO (022)5, bo,p = Piirep — 1 
Cb: Bo,p = O+ (02P) nr 59,9 = O €b,52,p — 


E(e)={ala< bi} = {b1 be, bi ba} E(0) =90 E(t) = At 


Above we represent the output og(s) of a state by +b where b € B is 


the element corresponding to the set og(s) coming from the isomorphism 
2h ESD, 


5 The Berry-Sethi Construction for KAT Expres- 
sions 


In short, there are two types of automata recognizing regular sets of guarded 
strings: 


S 2x (B(S))P8 Ga Be Gee 


The non-deterministic version has the advantage that it is very close to 
the expression, that is, one can easily compute the automaton from a 
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given KAT expression and back, but its semantics is not coalgebraic. The 
deterministic version fits neatly into the coalgebraic framework, but it has the 
disadvantage that constructing the automaton from an expression inherits the 
same problems as in the Brzozowski construction: the number of equivalences 
that need to be decided increases exponentially. We propose here yet another 
type of automaton to recognize guarded strings: the construction from an 
expression to an automaton will be inspired by the Berry-Sethi construction 
presented in Section 3 and it is linear in the size of the expression. 

Note that since KAT expressions can be interpreted as regular expressions 
over the extended alphabet B UP, the Berry-Sethi construction could be 
applied directly. 


Theorem 6 Let e be a KAT expression and Apos(e) be the corresponding 
position automaton. Then, G(e) = G(Apos(e))- 


Proof: We know that L(Apos(e)) = L(e). Now the result follows by using 
Kozen’s observation in [12] that given a guarded string e and an automaton 
A such that L(A) = L(e), one has G(e) = G(A). 

The resulting automaton would have precisely the same type as the 
non-deterministic version of automata on guarded strings. However, there 
would be one state for each input symbol in P UB. 


Example 5 As an example take the expression bp+C, whose marked version 
is bypo +¢3. The resulting position automaton will have 4 states and the 
transition function will be given by the following table 


follow | 

0 {1,3} 
1 {2} 
2 {1} 

3 {1} 


which results in the following automaton: 
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Let us now present a more involved example. Consider the expression 
saq(Graq)*az. The automaton will have 10 states, one for each symbol in 


the expression. 
follow 


{1} 
{2} 
{3} 
{4,8} 
{5} 
{6} 
{7} 


OANA KWNHNrF © 


(0) G)*.2)*.G —@)—7) @)—~-@ 


As this last example shows, if we just apply the Berry-Sethi algorithm directly 
to a KAT expression, without distinguishing between tests and actions, the 
number of states increases very fast. 

The construction we will show next includes only states for each atomic 
action in P, yielding smaller automata. From a given KAT expression e, we will 
construct an automaton (5,t) where t: S > B x B(S)®*?. This automaton 
type can be regarded as a compromise between the non-deterministic and 
deterministic versions of Kozen’s automata. 

We will start with generalizing the sets first, follow and last. 


first(e) = {(b, p) | bi bg...b,px € L(e), b= \/(br Aba A... 2b; )} 
follow(e,p) = {(b,q) | epbib2...bzpqy € Lie), b = VV(b1 Ab2 A... bp)} 
last(e) = {(b,p)| xpbib2...b, € L(e), b= V/(b1 Abo A... bn) } 


Note that the empty disjunction is 1 (and the empty conjunction is 0). 
Below, we will use expressions of the form e!, where ! is a special end-marker, 
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to avoid the computation of the last symbols that can be generated in e: 
(b, p) € last(e) = (b,!) € follow(e!, p). 

Given a KAT expression e with all action symbols distinct we construct 
the automaton A, = (Pos(e) U {0}, (og,ts)) where Pos(e) is the number of 
distinct action symbols in e and 


E(e) ifi=0 
os(i) = 4b if i >0 and (b,!) € follow(e!, p;) 
0 otherwise 


and t is given by the following rules 


oe 
(0) pi} G) iff (b, p;) € first(e) 
(i) (b,p;) ; 

G iff (b,pj) € follow(e!, ps) 


The way the automaton is defined, state z will only have incoming transitions 
labeled by (b, pi). Moreover, the fact that e has distinct symbols implies 
that the constructed automaton is deterministic, that is, t: S — Bx S®*?. 
Only after unmarking the labels p; non-determinism will be introduced, as 
we will observe in an example below. 

The guarded strings recognized by a state s € S of the automaton (5, t) 
where t: S > B x B(.S)?*? are now defined by the following rule 


reEG(s) & «s=awitha< E(s) 
or 2 =apz2’ with x’ € G(s’) for some s’ € ts(s)((b, p)) 
and for some b s.t.a <b 


Theorem 7 Let e be a guarded string, with all action symbols distinct, and 
let Ae = (Pos(e) U {0}, (og, ts)) be the corresponding automaton constructed 
as above. Then, G(e) = G(0). 


Proof: | By induction on the structure of e. 
If e = b then GS(b) = {a | a < b} and A, is a one state automaton 
with no transitions. Thus, G(0) = {a | a< E(b)} ={a]a<b} = G(bd). 
If e = p then GS(p) = {ap8 | a,8 € At} and A, is a two state 
automaton with only one transition from state 0 (with output E(p) = 0) to 
state 1 (with output 1) labeled by (1,p). Thus, 


G(0) = {a|a< E(p)}U {ap8 |a < 1,8 < 1} = {ap8 | a, 6 € At} = G(p) 
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For e = e; + e2, we have 
G(0) 
= {a|a< Ele, +e2)}U {apa’ | 2’ € G(ts(0)((b, p))),a < b} 
= {ala< E(e,)}Ufa|a< E(ei)} U 


{apz’ | x’ € ts(0)((b, p)), a < b, bib2... bapx € L(e1 + €2), 
b= VV (b1 Abo A...Dn)} 
= {fa|a< E(e,)}U{a|a< E(e1)} U 

{apz’ | x’ € tg(0)((b, p)), a < b, bibe... vie € Lie), 

= \V/(b1 Abo A...b,)} U 

{apz’ | x’ € tg(0)((b, p)),a < b, bbe... pa € L(e2), 
= \V/(b1 AboA.. br) } 
= G(e1) U G(e2) 
= G(ex ia €2) 


Note that bjb2...b,pa € L(e;), for i = 1,2, if and only if the state 0 of the 
automaton A-, has a transition labeled by (b, p) into some state. 

For e = e1e€2, things get slightly more complicated. Let us start with 
the easy bit: 


a€é G(0) Sa< E(eje2) Sa < E(e1) anda < E(e2) 6 a € G(eje2) 


Now take aip,...p,_1Q@n € G(0). This means that there exists a sequence 
of transitions: 


(b1,P1) (b2,P9) (bn—1,Pn—1) 
0 e 8 e 
Y 
br, 
such that a; < b;, for alli = 1,...,n. Because all the symbols in e;e2 are 


distinct we can divide the above sequence of transitions as follows. Let pz 
be the last action symbol in belonging to e;. We have 


(b1,P1) (b2,P2) (bk Pp) 
0 e ac ——+>e 
(bi-+1Peta)_ ad 


e-— >@ Dn citns +e 
(be42,Pp42) (bk 43,PK43) {Bn—1;Pn—1) y 
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and we observe that bg; is such that xp,bg+41pz4,y € L(e1e2). Thus, 
be+i = bg, bz,, such that xp,b;,, € L(e1) and bz, ;p,4iy € L(e2) and, as 
a consequence, 


(Diew1» Pk) € last(e,) and (bo 44, Pk-+1) € first (ez) 


Now we can conclude using the induction hypothesis since a; p, ...@,P,AK+1 € 
G(01), where 0, is the state 0 of Ae,, and a%41Pp44---An—1Pn_1On © G(02), 
where 02 denotes the state 0 of Ae,, and therefore: 


Q1Py---Apppaesz1 € G(er) and agyippyy ---An—1Py_1An © G(e2) 
A Py --- UPLOK+41PR41 + An—1Pn_1On © G(ere2) 


The case e* follows a similar reasoning as in e1e2 and is left to the reader. 

This theorem refers to marked expressions. Note, however, that un- 
marking the labels of the automaton only changes the action symbols and it 
will also yield G(0) = G(0), where G(0) denotes the set of guarded strings 
recognized by state 0 of the unmarked automaton and G(0) the unmarking 
of the set of guarded strings recognized by state 0 of the marked automaton. 

Next, we present an algorithm to compute the sets first, follow and 
last for KAT expressions. 


Proposition 2 Let e be a KAT expression with distinct symbols. F, defined 
by the rules below, is such that F(e, {(1,!)}) yields a set of pairs of the form 
(p;, follow(e!, p;)). The rules are: 


F(e; + €2,S) — F(e1, S) UF(eo, $) 

F(e1.e2, 9’) = F(ej, first(e2) U E(e2).S) U F(e2, S$) 

F(ej, S) = F(e1, first(e1) US) 

F(p, S) = {(p,S)} 

F(b, S) = 0 

where 

first(e; + e2) = first(e1) U first(e2) 
first(e1.€2) = first(e,) U E(e1).first(e2) 
first(et) = first(e1) 
first(p) = {(1,p)} 
first(b) =U 


Note the similarities between Propositions 2 and 1 (the proofs are also 
similar and hence we do not include them here). The fact that the Boolean 
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algebra B generalizes the two element Boolean algebra of classical regular 
expressions is reflected in the clause for the concatenation in the following 
way. The test for empty word og is replaced by the Boolean value of a KAT 
expression e and the multiplication is now redefined to propagate the tests: 


pguil? if b =0 
“| {(bb’,p) | (b’,p) € S} otherwise 


Example 6 We show now two examples of the algorithm above. We start 
with applying to the expression e = b, + bop, which we already used in 
Examples 3 and 4. This expression already has all action symbols distinct so 
no marking is needed. First, we compute F(e, {(1,!)}): 


F (by + bop, {(1,!)}) = F(b1, {(4,!)}) U F(bop, {(1,!)}) 
= F(p, {(4,!)}) 
= {(p, {(1,!)})} 
Thus, because first(e) = {(b2, p)} and E(e) = bi, then A, is given by 


(b2,p) 
o @ 


by 1 


Next, we consider the expression e1 = b1(pqb2 + ppb3 + b4). We have 
E(e1) = bibs, €1 = b1(p1 9262 + p3p4b3 + b4) and 


first(@) 
= first(b)) U E(b1).first((p qo b2 + p3p4b3 + b4)) 
= b1-{(1, 71), (1, p3)} = {(01, Py), (01, P3) F 


Blei4(25)}) 
= F(p, 9262 + p3p4b3 + ba, {(1,!)}) 
F(p, qb2, {(1,!)}) U F(p3p46s, {(4, !) }) 
= F(p,, {(%, 1)}) UF (qo, E(b2).{(1,!)}) UF (ps, {(1, pa) }) 
UF (p,, E(b3).{(4, !) }) 
= {(py,{(4, 92) }), (9a, {(b2,!) ), (pa, ((4, Pa) b), (Pa, (035!) F) F 
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The automaton Ae,, after unmarking, is then given by: 


0 by 64 0 
(b1,p) AY (b1,p) 
(0) 3) 
(1,4) (1,p) 
| 
b9 b3 


The non-deterministic version of Kozen’s automata on guarded strings would 
have 7 states and 8 transitions, whereas the (minimal) deterministic version 
would have 5 states (same as the automaton above), but 8x 8 = 64 transitions 
since for B= {b1, bz, bs} the set At has 8 elements. 


6 Conclusion 


This paper contains an exercise on KAT expressions. First, we show that 
one can compile a non-deterministic automaton (on guarded strings) from 
a KAT expression by directly applying the Berry-Sethi construction, a very 
efficient algorithm for classical regular expressions. Secondly, we present a 
new automata model for KAT expressions, which can be seen as a compromise 
between Kozen’s deterministic and non-deterministic models. We then adapt 
the Berry-Sethi construction to the new model. Compiling KAT expressions 
into the new model will yield automata with fewer states, which is an 
important feature for certain applications on program verification. The 
constructed automata have however more transitions and it remains to be 
explored how in practice this affects the efficiency of the construction. 
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